Add columnar data access for memory-efficient row processing by jayantsing-db · Pull Request #975 · databricks/databricks-jdbc

jayantsing-db · 2025-09-02T05:50:35Z

Description

This PR contains changes from the PR #966 as well.

Introduce ColumnarRowView to provide direct access to columnar data without
materialising entire result sets into row objects. This reduces memory
allocations by allowing individual cell access via getValue(row, col)
instead of creating List<List<Object>> structures.

Key changes:

New ColumnarRowView class with direct columnar access methods.
Updated LazyThriftResult to use columnar view instead of materialized rows.
Added utility method in DatabricksThriftUtil for creating columnar views.
Comprehensive test coverage for all column types and null handling.

This optimization maintains API compatibility while significantly reducing
memory overhead for large result sets.

Following the changes introduced in PR #966, the following improvements were
observed during a test that executes a SQL query retrieving 5 million rows:

Current heap usage over time:

Improved heap usage over time:

The first image shows multiple significant CPU usage spikes reaching around 15%, while the second image shows consistently flat CPU usage at approximately 3%.
The erratic CPU behavior in the "before" state has been completely smoothed out.
CPU usage is now consistent and predictable rather than volatile.
Memory usage dropped from a peak of 8440 MB down to just 745 MB - that's roughly an 91% decrease.
The large memory spike and sustained high usage in the first image has been completely resolved.
Memory usage is now consistently low and stable throughout the monitoring period.
There is no negative effect on execution time; in fact, it appears to improve, likely due to more active GC pauses in the previous state.

Testing

Unit tests
e2e tests
FakeService tests

Additional Notes to the Reviewer

This reverts commit 76e36e3.

…lumnar-view

gopalldb · 2025-09-05T06:29:02Z

+  }
+
+  /** Interface for accessing column values by index without materializing the entire column. */
+  private interface ColumnAccessor {


use separate files for interface and impl

gopalldb · 2025-09-05T06:51:23Z

+    if (column.isSetStringVal()) return column.getStringVal().getValuesSize();
+
+    throw new DatabricksSQLException(
+        "Unsupported column type: " + column, DatabricksDriverErrorCode.UNSUPPORTED_OPERATION);


what about complex datatypes? Will they also be covered in above primitive types?

We only support these

databricks-jdbc/src/main/java/com/databricks/jdbc/common/util/DatabricksThriftUtil.java

Line 230 in 970c4c8

private static List<?> getColumnValues(TColumn column) throws DatabricksSQLException {

columns. There is nothing new added or removed in these changes. If complex types come as binary (which i think is the case), complex types are supported. Otherwise, not and this is the current behaviour too.

gopalldb · 2025-09-05T06:56:51Z

+   *     out of bounds
+   */
+  @Override
+  public Object getObject(int columnIndex) throws DatabricksSQLException {


will this work out of box? You return primitive types from ColumnAccessor, and here we can have complex types as well. Will the conversion happen implicitly?

There is a binary type as well. Added more details #975 (comment) in this comment.

…lumnar-view

Copilot

Pull Request Overview

This PR introduces a memory-efficient columnar data access mechanism for JDBC result processing. Instead of materializing entire result sets into List<List<Object>> structures, it provides direct access to columnar data through a new ColumnarRowView class, resulting in significant memory reduction (up to 91% in testing) and improved CPU performance.

Introduces ColumnarRowView class for memory-efficient row-by-row data access
Updates LazyThriftResult to use columnar views instead of materialized row lists
Adds utility method in DatabricksThriftUtil for creating columnar views

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 1 comment.

File	Description
`ColumnarRowView.java`	New class providing direct columnar access with `getValue(row, col)` method and null handling
`LazyThriftResult.java`	Refactored to use `ColumnarRowView` instead of `List<List<Object>>` for batch processing
`DatabricksThriftUtil.java`	Added `createColumnarView()` utility method as memory-efficient alternative
`ColumnarRowViewTest.java`	Comprehensive test coverage for all column types, null handling, and boundary conditions

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-11T11:09:44Z

+    private final BitSet nullBits;
+
+    public TypedColumnAccessor(List<T> values, byte[] nulls) {
+      this.values = values;
+      this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;
+    }
+
+    @Override
+    public Object getValue(int rowIndex) {
+      if (nullBits != null && nullBits.get(rowIndex)) {
+        return null;


Creating a new BitSet for every column accessor could be memory-intensive for large datasets. Consider lazy initialization or caching the BitSet creation to improve memory efficiency.

Suggested change

private final BitSet nullBits;

public TypedColumnAccessor(List<T> values, byte[] nulls) {

this.values = values;

this.nullBits = nulls != null ? BitSet.valueOf(nulls) : null;

}

@Override

public Object getValue(int rowIndex) {

if (nullBits != null && nullBits.get(rowIndex)) {

return null;

private final byte[] nulls;

private BitSet nullBits;

public TypedColumnAccessor(List<T> values, byte[] nulls) {

this.values = values;

this.nulls = nulls;

this.nullBits = null; // Lazy initialization

}

@Override

public Object getValue(int rowIndex) {

if (nulls != null) {

if (nullBits == null) {

nullBits = BitSet.valueOf(nulls);

}

if (nullBits.get(rowIndex)) {

return null;

}

Nice suggestion. Thanks. Will implement in subsequent PR.

## Description  This PR introduces lazy loading support for inline Arrow results to improve memory efficiency when handling large result sets. Previously, InlineChunkProvider would eagerly fetch all arrow batches upfront when results had hasMoreRows = true, which could lead to memory issues with large datasets. This change splits the handling into two separate paths: 1. Lazy path (new): For Thrift-based inline Arrow results (when ARROW_BASED_SET is returned), we now use LazyThriftInlineArrowResult which fetches arrow batches on-demand as the client iterates through rows. This is similar to how LazyThriftResult works for columnar data. 2. Remote path (existing): For URL-based Arrow results (URL_BASED_SET), we continue using ArrowStreamResult with RemoteChunkProvider which downloads chunks from cloud storage. The InlineChunkProvider is now only used for SEA results with JSON_ARRAY format and INLINE disposition (contain all data inline {no hasMoreRows flag set}). This will reduce memory consumption and improve performance when dealing with large inline Arrow result sets similar to #975. ## Testing  - Unit tests - Integration tests - Manual testing ## Additional Notes to the Reviewer  Bypassing an existing failure on CI/CD because of 3e4f21c

jayantsing-db added 5 commits August 28, 2025 00:14

Introduce lazy fetch requests

7afdb22

temp benchmarking

76e36e3

Revert "temp benchmarking"

29197e8

This reverts commit 76e36e3.

Column view

c7e8ea7

Merge remote-tracking branch 'databricks/main' into fetch-requests-co…

e09688c

…lumnar-view

gopalldb reviewed Sep 5, 2025

View reviewed changes

jayantsing-db added 2 commits September 5, 2025 17:36

Merge remote-tracking branch 'databricks/main' into fetch-requests-co…

40df7de

…lumnar-view

Merge remote-tracking branch 'databricks/main' into fetch-requests-co…

f8cb9a8

…lumnar-view

gopalldb approved these changes Sep 10, 2025

View reviewed changes

jayantsing-db enabled auto-merge (squash) September 10, 2025 12:42

Merge remote-tracking branch 'databricks/main' into fetch-requests-co…

c5bb7cb

…lumnar-view

jayantsing-db requested a review from Copilot September 11, 2025 11:09

Copilot AI reviewed Sep 11, 2025

View reviewed changes

Next changelog

61d9702

jayantsing-db merged commit 43ac32b into databricks:main Sep 11, 2025
12 of 13 checks passed

jayantsing-db mentioned this pull request Sep 30, 2025

Implement lazy loading for inline Arrow results #1029

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add columnar data access for memory-efficient row processing#975

Add columnar data access for memory-efficient row processing#975
jayantsing-db merged 9 commits into
databricks:mainfrom
jayantsing-db:jayantsing-db/fetch-requests-columnar-view

jayantsing-db commented Sep 2, 2025

Uh oh!

gopalldb Sep 5, 2025

Uh oh!

gopalldb Sep 5, 2025

Uh oh!

jayantsing-db Sep 5, 2025

Uh oh!

gopalldb Sep 5, 2025

Uh oh!

jayantsing-db Sep 5, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Sep 11, 2025

Uh oh!

jayantsing-db Sep 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jayantsing-db commented Sep 2, 2025

Description

Testing

Additional Notes to the Reviewer

Uh oh!

gopalldb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gopalldb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jayantsing-db Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

gopalldb Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

jayantsing-db Sep 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

jayantsing-db Sep 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants